Daily Weather Data Analysis and Classification using Feature Importance

In this article, we analyze a weather dataset from Kaggle.com.

Table of Contents

Daily Weather Datase

Data description from Kaggle:

Loading the Dataset

Features

Columns Description
Air Pressure Air pressure StartFragment in hectopascal (100 pascals) at 9 AM
Air Temperature Air temperature in degrees Fahrenheit at 9 AM
Avg Wind Direction Average wind direction over the minute before the timestamp in degrees (0 starts from the north) at 9 AM
Avg Wind Speed Average wind speed over the minute before the timestamp in meter per seconds (m/s) at 9 AM
Max Wind Direction Highest wind direction in the minute before the timestamp in degrees (0 starts from the north) at 9 AM
Max Wind Speed Highest wind speed in the minute before the timestamp in meter per seconds (m/s) at 9 AM
Min Wind Speed Smallest wind speed in the minute before the timestamp in meter per seconds (m/s) at 9 AM
Rain Accumulation Accumulated rain in millimeters (mm) at 9 AM
Rain Duration Length of time rain in seconds (s) at 9 AM
Relative Humidity (Morning) Relative humidity in percentage in at 9 AM
Relative Humidity (Afternoon) Relative humidity in percentage at 3 PM

For convenience, we would like to modify the feature names.

Preprocessing

Imputing Missing Values

Note that

Problem Description

Let's set Relative Humidity (Afternoon) as the target variable. This means given the dataset and using the rest of the features, we would like to know whether is humid or not at 3 PM. In doing so, we can consider the median of Relative Humidity (Afternoon). Then, assign 1 to values over or equal the median value, and 0 to values under the median value.

Modeling

First off, let's look at the variance of our dataset features.

Furthermore, we would like to standardize features by removing the mean and scaling to unit variance. In this article, we demonstrated the benefits of scaling data using StandardScaler().

A number of functions that we would use.

DecisionTreeClassifier

First, let's try scikit-learn Decision Tree Classifier.

Random Forest Classifier

Next, let's use scikit-learn Random Forest Classifier.

Final Thoughts

We can see the area under the curve for Random Forest Classifier is better, therefore, this classifier performs the best here.